Skip to content

BUG: Timedelta not formatted correctly in to_json #28595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

cbertinato
Copy link
Contributor

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Sep 24, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR


if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso":
obj = obj.copy()
self.obj = obj.apply(lambda x: x.isoformat())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm not sure about this approach - this doesn't work if the timedelta is contained within an object array right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it won't. But the modifications in objToJSON.c should take care of that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which reminds me, there should probably also be a test for the astype(object) case. I will add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this at the Python level is ultimately the question though? This seems like it negates the need to make the change to the extension module since these delta objects would subsequently be strings when iso format is requested and not actually delta objects.

FWIW this is going to be rather non-performant, especially for large DataFrames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the python-level change doesn't entirely negate the need for the change in the extension module.

If one does df.astype(object).to_json(date_format="iso"), then we do not execute this code in _json.py and instead goes to the modifications made in objToJSON.c.

So in the case where we pass an object array into the extension module, we can take care of the serialization in C. In the case where we attempt to serialize the DataFrame directly, timedeltas appear differently. Namely, they come into the C code as NumPy timedeltas for which there is no method to serialize as ISO. We could perhaps do a conversion to a Pandas timedelta in the C code and then use the isoformat() method inside the extension module.

I went with this python-level change based on earlier discussions we had about this issue. In particular: "This would mean dispatching to vectorized methods in Python where applicable (i.e. a DatetimeArray is converted to an array of ISO strings before getting to the serializer)..."

We do end up with a less performant result in the case that we directly serialize the DataFrame but may gain in simplicity. The fix for the object dtype case should still be good performance-wise.

That said, I would be amenable to attempting a fix contained entirely within the extension module.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up

)
def test_series_timedelta_to_json(self, date_format, delta, formatted):
# GH28156: to_json not correctly formatting Timedelta
s = Series([pd.Timedelta(d) for d in delta])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be able to use pd.timedelta_range

def test_series_timedelta_to_json(self, date_format, delta, formatted):
# GH28156: to_json not correctly formatting Timedelta
s = Series([pd.Timedelta(d) for d in delta])
s.index = ["one", "two"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is modifying the index critical for the test? If not best to exclude

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say it's critical. I had initially thought to include it because in a previous iteration I had some code in there that was resetting the index and therefore causing a test like this to fail. But I can see that if the changes are sufficiently well-constrained to not mess with anything outside of what it should do, then changing the index in the test might not be critical.

result = json.loads(s.to_json(date_format=date_format))
expected = {k: v for k, v in zip(s.index, formatted)}

assert result == expected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm can you make the test compare against the actual expected output? Right now I'm not sure this actually tests the date format, i.e. it could still always write out in epoch and just infer that back on the way in and test would still pass

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a benchmark that contains time deltas to track performance?


if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso":
obj = obj.copy()
self.obj = obj.apply(lambda x: x.isoformat())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up

@@ -171,6 +171,34 @@ def _write(
class SeriesWriter(Writer):
_default_orient = "index"

def __init__(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you annotate these signatures?

@pep8speaks
Copy link

pep8speaks commented Sep 26, 2019

Hello @cbertinato! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-05 16:30:59 UTC

@cbertinato
Copy link
Contributor Author

cbertinato commented Sep 26, 2019

There was an existing DataFrame that mixed timedeltas, datetimes, and ints, so just to be safe I added another with only timedeltas. I also added a benchmark with date_format="iso". Just about all DataFrames containing timedeltas showed significant change:

       before           after         ratio
     [4fb853f6]       [aa9749a2]
     <28256-iso-date-json~2>       <28256-iso-date-json>
+        71.4±2ms       3.55±0.07s    49.76  io.json.ToJSON.time_to_json_iso('records', 'df_td')
+        68.8±1ms       3.40±0.06s    49.40  io.json.ToJSON.time_to_json_iso('values', 'df_td')
+        93.4±5ms       3.53±0.06s    37.81  io.json.ToJSON.time_to_json_iso('split', 'df_td')
+         179±7ms       3.79±0.03s    21.14  io.json.ToJSON.time_to_json_iso('split', 'df_td_int_ts')
+        183±20ms       3.50±0.05s    19.11  io.json.ToJSON.time_to_json_iso('values', 'df_td_int_ts')
+         184±9ms       3.43±0.08s    18.62  io.json.ToJSON.time_to_json_iso('records', 'df_td_int_ts')
+        363±10ms       3.70±0.05s    10.17  io.json.ToJSON.time_to_json_iso('columns', 'df_td')
+        375±10ms       3.70±0.05s     9.86  io.json.ToJSON.time_to_json_iso('index', 'df_td')
+        482±20ms       3.92±0.08s     8.13  io.json.ToJSON.time_to_json_iso('index', 'df_td_int_ts')
+         483±9ms       3.85±0.03s     7.96  io.json.ToJSON.time_to_json_iso('columns', 'df_td_int_ts')
+        116±10ms         482±30ms     4.16  io.json.ToJSON.time_to_json_iso('split', 'df_int_floats')
+        85.8±5ms         336±40ms     3.91  io.json.ToJSON.time_to_json_iso('values', 'df_int_floats')
+        120±10ms         455±30ms     3.80  io.json.ToJSON.time_to_json_iso('records', 'df_int_floats')
+        410±10ms         976±20ms     2.38  io.json.ToJSON.time_to_json_iso('index', 'df_int_floats')
+        420±10ms         982±20ms     2.34  io.json.ToJSON.time_to_json_iso('columns', 'df_int_floats')

Interestingly, there are some significant changes for int-float DataFrames as well.

@cbertinato
Copy link
Contributor Author

I tried instead converting to pandas Timedelta in C which, performance-wise, is less bad than converting in Python, but still not that great:

       before           after         ratio
     [6396bc37]       [b97525a0]
     <master>         <28256-iso-date-json>
+        72.0±1ms       2.28±0.05s    31.61  io.json.ToJSON.time_to_json_iso('records', 'df_td')
+        68.3±4ms       2.14±0.01s    31.27  io.json.ToJSON.time_to_json_iso('values', 'df_td')
+        87.7±3ms       2.19±0.03s    25.01  io.json.ToJSON.time_to_json_iso('split', 'df_td')
+         148±2ms       2.30±0.01s    15.55  io.json.ToJSON.time_to_json_iso('values', 'df_td_int_ts')
+         174±7ms       2.37±0.04s    13.65  io.json.ToJSON.time_to_json_iso('split', 'df_td_int_ts')
+        179±10ms       2.40±0.01s    13.40  io.json.ToJSON.time_to_json_iso('records', 'df_td_int_ts')
+         351±9ms       2.42±0.04s     6.92  io.json.ToJSON.time_to_json_iso('columns', 'df_td')
+        356±10ms       2.45±0.01s     6.87  io.json.ToJSON.time_to_json_iso('index', 'df_td')
+         449±7ms       2.65±0.07s     5.89  io.json.ToJSON.time_to_json_iso('index', 'df_td_int_ts')
+         471±8ms       2.50±0.01s     5.30  io.json.ToJSON.time_to_json_iso('columns', 'df_td_int_ts')

I wonder whether we can ever do much better than this without a function that formats NumPy timedeltas directly.

@WillAyd WillAyd mentioned this pull request Oct 9, 2019
5 tasks
@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

@cbertinato what is status of this?

@cbertinato
Copy link
Contributor Author

Unless I'm missing another way to solve this, it looks like we take a performance hit. So I suppose the question is whether this is tolerable for this fix, or whether we should trash this branch and go with a vectorized solution something like Will's work in #28863.

@jreback
Copy link
Contributor

jreback commented Nov 13, 2019

instead of this approach i would convert the timedeltas to longs (using vectorized methods on the series) before passing to the c based dumper

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

I think this should be easier to fix after #30283. I think will just need a new method for time delta to iso though

@cbertinato
Copy link
Contributor Author

In C?

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

Right I think would be easier to do in C since the datetime code is already there

@cbertinato
Copy link
Contributor Author

Sorry for the severe lag on this. Will pick it back up now that #30283 is in.

@WillAyd
Copy link
Member

WillAyd commented Jan 10, 2020

@cbertinato I think I'm close on a C impl actually. might be able to push up soon to compare

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

Thanks for the PR! I think will be superseded by #30903 though so closing for now

@WillAyd WillAyd closed this Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timedelta in to_json object array and ISO dates not handled properly
4 participants